home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 21
/
Cream of the Crop 21 (Terry Blount) (October 1996).iso
/
comm
/
htmst608.zip
/
HTMSTRIP.DOC
< prev
next >
Wrap
Text File
|
1996-08-31
|
21KB
|
414 lines
HTMSTRIP.DOC 1 Revised: 08-31-96
The HTMSTRIP.EXE program attempts to read HTML pages, remove the HTML coding,
and write the file out as something more useful. Features of this program:
* Can be run across an entire subdirectory (for example, your entire
cache subdirectory), and will only process the HTML documents that it finds.
(There are some options on this.)
* Removes all imbedded HTML commands.
* Recodes the standard HTML "entity references" (e.g. "©" becomes
"(c)"). The actual replacements are coded in a user-modifiable lookup file.
* Handles standard indent, heading, selection groups, menus, tables, etc.
* Reflows all text as appropriate
* Optionally, will replace Link, Image, and Input references with
user-definable text representations.
* Optionally, alerts you to possible errors in the HTML code itself.
HTML codes are surrounded within <...> indicators. For upward compatibility
reasons, Web browsers ignore any codes that they don't understand and just
process the ones they can handle.
Note that the HTMSTRIP command is currently geared for handling HTML 2.0 files
and then Netscape table-specific extensions (added to HTML 3.0).
HTMSTRIP removes all HTML codes. It also handles the standard HTML "&xxx;"
"entity references" (e.g. "©" is replaced by "(c)"). You can add or change
these replacements as desired by using the INI file (documented later).
HTMSTRIP is also tuned to allow it to specially-handle several embedded HTML
codes. These codes are the following:
<A ...> External link
<BLOCKQUOTE>...</BLOCKQUOTE> Indented block of text
<BR> Forced line break
<CAPTION>...</CAPTION> Title for a table
<CENTER>...</CENTER> Centering text
<DD> Term definition
<DIR>...</DIR> Directory list of items
</DL> End of definition list
<DT> First term of definition list/glossary
<H1> to <H6>...</H1> to </H6> Heading items
<HR> Horizontal rule
<IMG ...> Image
<INPUT ...> User input
<LI> Menu/Ordered/Unordered/Directory list item
<MENU>...</MENU> Menu listing
<OL>...</OL> Ordered listing
<OPTION> Used for single/multiple choice menus
<P> Paragraph indicator
<PRE>...</PRE> Preserve spacing block (preformatted text)
<SCRIPT>...</SCRIPT> Java script blocks are ignored
<SELECT>...</SELECT> Block for single/multiple choice menu
<TABLE>...</TABLE> Table block
<TD>...</TD> Table data (cell)
<TH>...</TH> Table heading
<TITLE>...</TITLE> Title item
<TR>...</TR> Table row
<UL>...</UL> Unordered listing
HTMSTRIP.DOC 2 Revised: 08-31-96
If you run across other codes that become vital, let me know and I'll try to
handle them somehow.
How to get HTML files:
Some people who are using regular Web browsers like Mosaic or Netscape don't
realize that they're automatically saving HTML files to their hard disk
throughout every Web session. That's because just about every Web browser saves
the most-recently accessed files from the Web (including HTML source code,
GIF's, and JPG's) on your hard disk and reads them from there instead of
requiring you to download them every time you go back to a previous page. This
is typically settable by you under "Preferences" and "Cache" on your Web
browser.
I usually set my Web browser to have a huge cache, maybe 10MB. Anything beats
downloading the same pages over again even at 28.8K. And I make sure that I do
not have anything specified like "clear cache at the end of every session". Then
I just go through the files in the cache subdirectory afterward and reprocess
them.
Two disadvantages to a cache... It takes up hard disk space but, hey, the Web
browser is typically in Windows so why are you surprised. The second
disadvantage is that if the page actually changes between sessions, you
typically won't notice the new page as long as it remains in your cache. If you
think a page is still in cache and should have been changed but didn't, you can
typically ask your Web browser to reload the page. On some browsers, this is
shown as an arrow in the form of a circle.
HTMSTRIP can process the entire cache subdirectory. It automatically detects
non-HTML files for you and processes accordingly. It creates new text file
versions of just the HTML pages it finds.
By the way, for some reason, the current beta version of Netscape typically
ignores my cache setting for some reason. I don't have the slightest idea why.
As a result, when you Alt-F4 out of Netscape, it goes through and deletes all
but a few of the temporary files. This is annoying to say the least. As a
result, I have to run HTMSTRIP from a DOS window just before leaving Netscape.
If anyone knows why it does this to me, please let me know!
Specifying parameters:
Parameters for this program can be set in the following ways. The last setting
encountered always wins:
- Read from an *.INI file (see BRUCEINI.DOC file),
- Through the use of an environmental variable (SET HTMSTRIP=whatever), or
- From the command line (see "Syntax" below)
HTMSTRIP.DOC 3 Revised: 08-31-96
Defining entity references:
HTMSTRIP will process an entity reference definition file is one is found. This
table can be in your standard *.INI file (e.g. HTMSTRIP.INI) if desired or it
can be a separate file specified using the /Linitfile parameter.
Entity references are how non-standard characters like the copyright character
are handled in HTML pages. Entity references are indicated as "&xxx;" where
"xxx" is either a code or a number preceded by a pound sign. The copyright
symbol is indicated in HTML as "©".
A default HTMSTRIP.INI is provided with over 230 entity reference lookups. To
define or change these lookups, the INI file should include a series of lines in
the following format:
&xxx; = outstr
where "&xxx;" is the HTML sequence and "outstr" is what you want to replace it
with. The "outstr" portion can consist of regular non-space ASCII text
characters (like "A" or "z") as well as hexadecimal values (in the form &Hxx) or
decimal values (in the form \nnn). (See the BRUCEHEX.DOC file.) It can also be
the word "NULL" which translates the string into nothing. You cannot use a
space or equal sign in "outstr"; use the hexadecimal or decimal representations
instead. The table does not have to be in any specified order. Lines can end
with "/*" followed by a comment if you want. Examples:
© = (c) /* Copyright symbol
° = °
é = é
ê = ê
è = è
= \032
Remember that "&xxx;" entity references (yes, I hate that phrase) are
case-sensitive in HTML. "°" will not find "&Deg;".
There seems to be a trend of late to relax some of the replacement coding
requirements in Web pages. The ";" is now, apparently, becoming optional.
Numeric replacements (e.g. " ") seem to no longer require the leading pound
sign. Therefore, HTMSTRIP looks for both of these iterations for any
appropriate lookup. "©" will find "©" and "™" will find "&153".
The lookup itself has to be entered in the formally correct way thoug